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[57] ABSTRACT 

A method and apparatus for identifying any one of a 
plurality of words, each word having an audible form 
represented by a sequence of spoken speech elements, with 
each speech element having a respective position in the 
sequence, which involves: receiving spoken speech ele- 
ments of a word and interpreting each received speech 
element, wherein each spoken speech element a may be 
interpreted as any one of a plurality of different speech 
elements (3, one of the speech elements p being the same as 
speech element a; assigning to each of the possible speech 
elements a respective plurality of probabilities, P^, that the 
speech element will be interpreted as a speech element p 
when a speech element a has been spoken; storing data 
representing each word, the data for each ward including 
identification of each speech element in the word and 
identification of the respective position of each speech 
element in the sequence of speech elements representing the 
word; receiving a sequence of speech elements spoken by a 
person and representing one of the stored words, and inter- 
preting each speech element of the spoken word and the 
position of each speech element in the sequence of spoken 
speech elements; and comparing the interpreted speech 
elements with stored data representing each word of the 
plurality of words and performing a computation, using the 
probability, P^, associated with each interpreted speech 
element p to identify the word whose speech elements 
correspond most closely to interpreted speech elements. 
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METHODS AND APPARATUS FOR Speech recognition of large isolated word vocabularies of 

IMPROVING THE RELIABILITY OF 30.000 words or mare requires that the utterances be broken 

RECOGNIZING WORDS IN A LARGE into phonemes or other articulatory events or, alternatively, 

DATABASE WHEN THE WORDS ARE *at the user verbally spell the word, in which case, his 

SPELLED OR SPOKEN 5 utterances of the letters of the alphabet are recognized by 

their phoneme content and then the sequence of letters is 

CROSS-REFERENCE TO RELATED used to identify the unknown word. In any large vocabulary 

APPLICATION system, both methods are needed to insure accuracy. The 

_ . „ user would first attempt to have the word recognized by 

This application is a continuation-in-part of appKcation simply saying the word. If this was unsuccessful, the user 
Sen No 999,062, U.S. Pat No. 5,454,062 filed Dec. 31, " SttaSiSve the option of spelling the worT 

^« A Problem occurs with spelling, however, because the 
m&2 : filed Mar. 27 1991, now U.S. Pat No. 5^4,560, ^^r^ h not ws il"inized by speech recog- 

g?** »f ™ k ^ZT^ N0 * niu^sysW For example; almost all iLSimteit 
W^i^S - ^**^ is trouble Ltmgu^^^ 

^ST^ N^^^f^e computer listings " ^^^^^i^u^^^^^^ 

submitted therewith are incorporated herein by reference. 0Q fo fact, most of ^ Aphabzx consists of single syllable 

BACKGROUND OF THE INVENTION utterances which rhyme with some other utterance, or 

which, although not rhyming, can be confused with some 

It is generally recognized that man-machine interaction other utterance. An example of the latter would be any pair 
can be enhanced by the ability to communicate audibly, or of &c letters -p^ «s» and "M" , all of which have the 
orally. A variety of interfaces have been developed, includ- samc Mtial soun(L similarly, many phonemes which sound 
ing input devices which identify spoken words and output alike can be mis-recognized. Hereinafter, references to let- 
devices which produce synthesized speech. While signifi- ^ or phonemes which sound alike are meant to include 
cant advances have been made with regard to output devices, those which rhyme and those which can otherwise be 
which respond to weU-defined signals, input devices have confused with one another, dearly, it is necessary for a 
posed more difficult problems. speech recognition system to deal with the errors caused by 

Such input devices must convert spoken utterances, Le. letters or phonemes which sound alike, 
letters, words, or phrases, into the form of electrical signals, Application Ser. No. 999,062, (hereinafter referred to as 
and must then process the electrical signals to identify the 3Q foe prior appUcation) discloses a method for identifying any 
spoken utterances. By way of example: acoustic signals onc of a plurality of utterances using a programmed digital 
constituting spoken utterances may be sampled at fixed computing system, each utterance having an audible form 
intervals; the pattern formed by a succession of sampled representable by a sequence of speech elements, and each 
values may then be compared with stored patterns repre- speech element having a respective position in the sequence, 
senting known spoken utterances; and the known spoken 35 the method ammrising: storing, in the computer system, a 
utterance represented by the stored pattern which matches digital representation corresponding to each of the plurality 
the pattern of sampled values most closely is assumed to be 0 f utterances and assigning a respective identifying desig- 
nee actual spoken utterance. The input devices which have mtiou to ^ utterance; creating a table composed of a 
already been proposed could, in theory, function with a high plurality of entries, each entry being associated with a 
degree of reliability. However, in the present state of the art, ^ unique combination of a particular speech element and a 
they are operated by programs which entail long processing particular position in the sequence of speech elements of the 
times that prevent useful results from being achieved in audible form of an utterance and storing in each entry the 
acceptably short time periods. identifying designation of each of the plurality of utterances 

One commercially available program for recognizing whose audible form is represented by a speech element 

spoken utterances is marketed by Lernout and Hauspie 45 sequence containing the particular speech element at the 

Speech Products U.S.A., Inc., of Woburn, Mass. under the particular position with which that entry is associated; 

product name CSR-1000 Algorithm. This company also converting an utterance to be identified and spoken by a 

offers a key word spotting algorithm under the product name person into a sequence of speech elements each having a 

KWS-1000 and a text-to-speech conversion algorithm under respective position in the sequence; reading at least each 

the product nameTTS-1000. These algorithms are usable on ^ table entry associated with a speech element and position 

conventional PCs having at least a high-performance 16 bit combination corresponding to the combination of a respec- 

fixed or floating DSP processor and 128 KB of RAM tive position in the sequence of the spoken utterance and the 

memory. particular speech element at the respective position in the 

The CSR- 1000 algorithm is supplied with a basic vocabu- sequence of the spoken utterance; and determining which 

lary of, apparently, several hundred words each stored as a 55 identifying designation appears most frequently in the 

sequence of phonemes. A spoken word is sampled in order entries which have been read in the reading step, 

to derive a sequence of phonemes. The exact manner in To implement that method, a collection of words to be 

which such sequence of phonemes is processed to identify recognized, Le. the "vocabulary" of the system are stored in 

the spoken word has not been disclosed by the publisher of a first database, with each word being assigned an identify- 

the program, but it is believed mat this is achieved by 50 ing designation, e.g. an identifying number, 

comparing the sequence of phonemes derived from the As a practical mnttw every spoken word is made up of a 

spoken word with the sequence of phonemes of each stored string of phonemes, or articulatory events, which occur in a 

word. This processing procedure is time consuming, which specific sequence. A few words and a number of letters such 

jrobably explains why the algorithm employs a vocabulary as V and "o" may consist of a single phoneme which may 

of only several hundred words. 65 be recognized by a system according to the invention. 

It would appear that the CSR-1000 algorithm could be Letters will typically consist of one or two phonemes and 

readily configured to recognize individual spoken letters. may be recognized by a system according to the invention. 
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A spoken language appears to consist of a defined number in the means fox receiving and interpreting, receiving a 
of distinct phonemes, each of which is identifiable by a sequence of speech elements spoken by a person and rep- 
specific symbol It is generally considered that English- resenting one of the stared words, and interpreting each 
language words contain 50-60 different phonemes and in the speech element of the spoken word and the position of each 
description of the present invention, each phoneme is 5 speech element in the sequence of spoken speech elements; 
assigned a respective numerical value. and 

Different speakers of a language may pronounce a word comparing the interpreted speech elements with stored 

differently so that a given word may be composed of data representing each word of the plurality of words and 

phoneme strings which differ from one another. To minimize performing a computation, using the probability, P^, asso- 

recognition errors resulting from such speech variations, it is 10 dated with each interpreted speech element p to identify the 

known to have a selected word spoken by a number of word of the plurality of words whose speech elements 

different speakers, determine the phoneme siring for the correspond most closely to interpreted speech elements, 

word spoken by each speaker, and derive a mean or average rpfttrrmpr to apppm™y 

phoneme value for each word portion for which the speakers REFERENCE TO APPENDIX 

produced a number of different phonemes. 15 Attached hereto and made a part of the present 

In the method described above, account is taken of specification, is an APPENDIX listing one example of a 

possible errors in the interpretation, or recognition, of indi- program which may be used in the practice of the invention, 

vidua! letters or phonemes by giving a high score to each This program can be readily supplemented by an average 

word having, in die same position in the sequence of letters programmer to deal with letters that sound alike but do not 

or phonemes, the same letter or phoneme as that interpreted, 20 rhyme. 

or rccogaiz«t and a lower scare for a word or phoneme BRIEF DESCRIPTION OF THE DRAWING 
which sounds like the interpreted one. The relative score 

values are selected empirically or intuitively and will not FIG. 1 is a block diagram of a system for practicing the 

change after the system has been programmed and placed present invention. 

into use. Therefore, the word recognition success rate will 23 FIG. 2 is a block diagram of an embodiment of computer 

vary from one user to another, depending on the degree to 2 of FIG. 1 for practicing the invention, 

which the speech characteristics, i.e. pronunciation, accent, FIG. 3 is a diagram illustrating the steps of a word 

etc., of the user conforms to those programmed into the identification method according to the present invention. 



interpreter. 



30 DESCRIPTION OF THE PREFERRED 

SUMMARY OF THE INVENTION EMBODIMENTS 



It is an object of the present invention to provide certain FIG. 1 is a block diagram illustrating a system which can 

improvements in the method disclosed in the prior applica- employed to implement the present invention. The heart 

tion and in systems for implementing that method. 35 of me system is aeon ventioiial, general purpose computer 2, 

» ^ a w^^a * . . . such as a PC containing a RAM memory 4 having a capacity 

A more specific object of the invention is to improve the „ P „ # , onn ™ . - . . . . ~*t— j 

..... \ . J . ^^V. . of at least 200 KB, Computer 2 is equipped, m the usual 

reliability of word recognition in comparison with the ... . . , _ . ^ 

. j. , j • * manner, with a keyboard, a monitor and means for connect- 

method disclosed in the poor application. iag colter 2 to peripheral components. 

Another object ofttie invention ^permit automatic Msodateii ^ COB9Bta 2 is a storage device 6, which 
adaptation of the word recognition procedure to individual « mybe bulled withincomputer 2. Storage device 6 may 

USerS " be a hard disk, a floppy disk, a ROM, a PROM, an EPROM, 

The above and other objects are achieved, according to an EEPROM, a flash memory, an optical disk, etc. One 

the present invention, by a method and apparatus for iden- possibility is to constitute storage device 6 as an optical disk 

ufying any one of a plurality of words using a programmed ^ or compact disk player which could form part of a separate 

digital data processing system, each word having an audible audio system installed, for example, in an automobile, 

form represented by a sequence of spoken speech elements, Connected to computer 2 are peripheral devices including 

with each speech clement having a respective position in the a spccch mtefilC6 ^ 10 ^ a spccch synthesizer 12. Input 

sequence, the digital data processing system bang con- to Dnit 10 j, p^ded by a microphone^, 

ncctcd to means fox receiving spoken speech dements of a wm l e the output from speech synthesizer 12 is delivered to 

word and interpreting each received speech element, a speaker 18. If storagedevice 6 is constituted by a disk 

wherein there is a plurality of possible speech elements, player in an audio system, speaker 18 may be constituted by 
each spoken speech element is a speech element a, each the speaker or speakers of that audio system, 
interpreted speech element is a speech element fi, and each The storage medium of storage device 6 contains the 
spoken speech dement a may be interpreted as any one of 55 operating program for recognizing spoken utterances, along 
a plurality of different speech elements p, one of the speech with a first database containing representations of the utter- 
elements |) being the same as speech element ot, word ances to be recognized and an associated identifying desig- 
identification being achieved by: nation for each stored utterance and a second database in the 

assigning to each of the possible speech elements a form of a table composed of a plurality of entries. The 

respective plurality of probabilities, P cp , that the speech eo identifying designations provided in the first database are 

element will be interpreted as a speech element p when a stored in appropriate entries of the table constituting the 

speech dement a has been spoken; second database in a manner which will be described in 

storing data representing each word of the plurality of greater detail below, 

words, the data for each word including identification of Speech synthesizer 12 and speaker 18 are connected to 

each speech dement in the word and identification of the 65 generate and emit spoken utterances constituting prompts 

respective position of each speech element in the sequence for the user of the system and the recognized versions of 

of speech elements representing the word; utterances spoken by the user. 
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The storage medium may be of a type, such as an optical 
disk, which can stoic the utterances in a form that can be 
directly reproduced, possibly via an amplifier and/or a 
digital/analog converter. In these cases, speech synthesizer 
12 can be replaced by such components. 

The basic operation of the system is as follows. At the 
start of operation, the portions of the operating program 
which must be resident in memory 4 are loaded therein from 
storage device 6. The operating program portion loaded into 
memory 4 may include that program portion which serves to 
convert spoken utterances into sequences of phonemes, 
which is a capability of the CSR-1000 algorithm referred to 
earlier herein. Then, an utterance spoken by a user of the 
system is picked up by microphone 16 and converted into an 
electrical analog signal which is delivered to interface 10. 
Depending on the program employed to derive sequences of 
phonemes, interface 10 may place the analog signal at a 
suitable voltage level and conduct it to computer 2, or may 
convert the analog signal into the form of digital samples. 
The spoken utterance is converted into a sequence of pho- 
nemes and this sequence is processed, according to the 
invention in order to identify the stored utterance which 
corresponds to the spoken utterance. Then, a sequence of 
phonemes associated with that stored utterance is conducted 
to speech synthesizer 12 and emitted in audible form by 
speaker 18 to allow the user to verify that the spoken 
utterance was correctly recognized. The computer may then, 
under control of its operating program, generate further 
audible utterances which may be prompts to the user to input 
a further spoken utterance containing certain information or 
may be output information derived from the information 
previously supplied in spoken form by the user. 

According to alternative embodiments of the invention, 
the spoken utterances are letters which spell a word. In this 
case, the identity of each letter is determined by Tnatrhing its 
phoneme or phonemes with stored patterns and the resulting 
sequence of letters constitutes the sequence of speech ele- 
ments which are processed to identify the correct stored 
utterance. 

Furthermore, embodiinents of the invention need not 
reproduce a stored utterance in audible, or any other, form. 
Instead, the stored utterances may constitute machine 
instructions which correspond to respective spoken utter- 
ances and which act to cause a machine to perform a desired 
operation. 

To cite one non-limiting example, the technique disclosed 
herein may be employed in a navigation system of the type 
disclosed in the above-cited application Ser. No. 675,632, in 
which case the user will be prompted to supply, in spoken 
form, identification of starting and destination points and 
will then be provided with a series of route directions. If the 
spoken information supplied by the user is in the form of 
spellings of starting and destination locations, the system 
may prompt the user to input each successive letter. 

The invention will be described with respect to a gener- 
alized embodiment in which a first database contains data 
identifying words, with each word being assigned an iden- 
tifying number. The standard, or average, phoneme string 
associated with the spoken version of each word is deter- 
mined by conventional procedures. Each phoneme in a 
string is located at a respective position, n, and each distinct 
phoneme is assigned a value m. 

The structure of this first database is illustrated by TABLE 
1, below, which represents a large vocabulary database 
containing K words in which, for each word, there is 
provided a respective identifying number (id#) and ***** 
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representing a letter and/or phoneme sequence which can be 
used for displaying and/or sounding the word for verification 
and/or for locating information about the word in connection 
with a specific application. 

5 

TABLE 1 



id # Word letter sequence Word phoneme sequence 



15 



1 

2 






101 


ALABAMA 


»-l-a~b-a-m-a 


102 


ARIZONA 


a-r-i-z-<Kii-a 


103 


BRAZIL 


b-r-a-z-i-1 


104 


CHICAGO 


c-h-i-c-ft-g-o 


105 


SEATTLE 


s-e-a-M-l-e 


' 106 


ATLANTA 


a-t-l-4-n-t-a 


107 


AR1CONE 


a-r-i-c-o-D-e 


K 







Then a second database is prepared as shown herebelow 
in TABLE 2. which shows a large vocabulary database 
containing subsets (n^m) of identifying numbers from the 
first database for words having the phoneme or letter m at 
25 position n of the phoneme or letter string. 



TABLE 2 



30 



35 





n-» 
1 


2 


3 


4 


N 


m 1 

i 




{2,1} 


{3,1} 




{N f l} 


2 


{U} 


{2,2} 


{3.2} 


{<U} 


• • {N,2} 


M-l 


{UM-1} 


{2^4-1} 


{3^-1} 


{4.M-1} . 


. . {N>t-1} 


M 


{UM} 


{2M} 




{4M1 


.. W*} 



Each entry in the second database is a subset (represented by 

40 {r^m}) containing the identifying numbers of all words in 
the first database for which phoneme or letter position n 
contains phoneme or letter m. 

In TABLE 2, the total number of phoneme or letter 
positions in a string is a maximum of N and the total number 

45 of different phonemes or letters is M. The value of N is 
selected to assure that essentially all phonemes or letters of 
each word can be accounted for. Below, for the sake of 
brevity, reference will be made to phonemes. It should be 
understood, however, that as a general rule reference to 

50 letters would be equally appropriate. 

The system is further provided with a scoring memory, or 
table, containing a number of locations equal to the number 
of words in the first database; each location is associated 
with a respective word identifying number. 

55 A spoken word is analyzed in order to derive its charac- 
teristic phoneme string. The string will have N or fewer 
phonemes and each phoneme can have any one of M 
different values. 
The phoneme value, m, of the spoken word at the first 

60 location in the string (n=l) is identified and for each member 
of the subset {1, m}, a score is placed in every scoring 
memory location associated with an identifying number in 
subset {1, m}. Then the phoneme value, m, of the spoken 
word at the second location in the string {n=2} is identified 

65 and, as above, for each member of the subset {2, m}, a score 
is placed in every scoring memory location associated with 
an identifying number in subset {2, m}. This score will be 
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added to any score previously placed in any of the scoring and {4, 2} will additionally be used to include a score in the 

locations associated with an identifying number in subset {2, scoring memory. Subsets {1.2} and {5, 2} can additionally 

m}* be used in the same manner. Although it might, on first 

This procedure continues for the entire phoneme string of consideration, appear that mis would reduce the probability 

the spoken wont or for a string of N phonemes, where N 5 rf achievin ^ recognition, it has. surprisingly, been 

may be larger or ^maUerthar i the plumber of phonemes in the fouQd ^ ^ ^ ^ h ^ ^ ^ ^ ^ 

string of the spoken word After the phoneme string has been .„ . _ J 1 . f_f - r: 

_ . ii . . . \ » . will, in fact increase the probability of correct recognition. 

processed, the scoring memory is interrogated and the word ' j & 

whose identifying number coaresponds to that of the scoring 111 addition, account may be taken of the fact that pho- 

memory location which has the highest score is determined to nemes which sound alike can be misunderstood by the 

to be the spoken word. device or program which interprets each spoken phoneme. 

It has been found that this procedure will yield the correct In order to minimize errors resulting from such incorrect 

word a surprisingly high percentage of the time. . identification of individual phonemes, all of the entries 

Then, the stared data is read from the location of the first associated with a given phoneme position (n) and with a 

database which has the corresponding identifying number, is phoneme (m) which rhymes with the phoneme that the 

The stored data can be used to reproduce the word in spoken system determined to have been spoken are also read and a 

form, as described above. score is placed in the scoring table far each identifying 

The system will then await a spoken response from the number which has been read. . 

us ^' . , . ^ - , _ If the spoken word is inputted by spelling that ward, then 

In order to reduce the frequency of recognition errors, a 20 t ^ ... , f . ^ . , , . ^ 

system according to the present invention may select, from elements wiU be letters, i.e. alphabetic characters, 

the scoring table, the utterance designations for the utter- te ^ ueilce of s P cecfa dcmcnts wm a sequence of 

ances which received the three highest scores, these desig- lctters * whkt «** lcttcr occu P ics a particular position. A 
nations being delivered in descending order of the scores. simplified example of this implementation will be presented 
The stored utterance associated with each selected identify- 25 ^low, using the words having id# 101-107 of TABLE 1. 
ing designation is then delivered to speech synthesizer 12 In TABLE 3 below, the distribution of identifying num- 
and emitted in spoken form from speaker 18. After each bers in the second database is illustrated. Thus, if the first 
synthesized utterance is emitted, the system waits for a letter of the spoken word is the letter "A", it is only 
response from the user, e.g. either "yes" or "no". If the user necessary to interrogate the subset {1,A} of TABLE 3, and 
responds with a "no" after each synthesized utterance is 30 so on for the remaining letters. It will be noted that, for the 
heard, it is concluded that the recognition process failed, the sa ke of simplicity, it has been assumed that each word has 
scoring memory is cleared, and the user is prompted, if a maximum of seven letters. However, the second database 
necessary to repeat the spoken utterance. It is anticipated can ^ Kt&Wshtd to identify words having any selected 
that such failures will be extremely rare. maximum number of letters. 

The storage of utterance identifying designations in the 35 
form of a table, as described above, represents a substantial 
improvement over the prior art because it results in a 
substantial reduction in the amount of data which must be 
processed in order to arrive at an identification of the spoken 
utterance. Specifically, for each position in the sequence of 40 
speech elements of a spoken utterance, it is only necessary 
to access the table entry associated with that speech element 
position and the particular speech element at that position of 
the spoken utterance. In other words, to correctly identify _ 
the spoken utterance, it is not necessary to access all of the 45 Jq7 
entries in the table. D 

In TABLE 2, the speech elements represented by m may E 105 105 

either be phonemes or letters of the alphabet. p 107 

Correspondingly, the speech element positions represented G 104 
by n will be either the positions of phonemes in a sequence so H 104 
or the positions of letters in a word, respectively. 

The speech recognition procedure, as described thus far, jj£ 
is implemented by reading only one entry of the second 
database table fox each speech element position (n). 
However, it has been found that when each spoken utterance 55 
is a word which is converted into a string of phonemes, the 

speaker may pronounce the word in such a manner as to add M J01 

or delete one or two phonemes. If this should occur, the N 106 102 

spoken word will not be correctly recognized. According to 107 
a further feature of the invention, the probability of achiev- 60 0 104 
ing correct recognition is increased by taking into account 

the entries associated with a particular phoneme which are ....... 

immediately adjacent that entry associated with the correct 
position n. For example, referring to TABLE 2 above, if the 

phoneme at position n=3 of the spoken word is being 65 107 
compared with the stored data, and the value of mis pho- s 105 

neme is 2, the identifying numbers in at least subsets {2, 2} 



TABLE 3 



m 

I 


a 

1 


2 3 


4 


5 


6 7 


A 


101 


101 


106 


101 


101 




102 


103 




104 


102 




106 


105 






106 




107 










B 


103 




101 






C 


104 




104 







I 102 103 

104 



101 106 103 
105 



102 
103 
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TABLE 3-continued 



m n -+ 

4 1 2 3 4 5 6 7 

T 106 105 105 106 



Z 102 
103 



In the example illustrated in TABLE 3, it has again been 
assumed that only one subset (n,m) is addressed for each 
letter of the spoken word. However, it has been found that 
the probability of correct recognition of a word which has 
been inputted in the form of spelling can be enhanced by 
applying two stratagems. 

Firstly, it may occur not infrequently that when one spells 
a word, a letter will be omitted or added. In this case, if only 
the subset contained in one entry of the second database is 
read for each letter of the word to be identified, the prob- 
ability of correct identification is reduced. According to the 
first stratagem relating to this implementation, for a given 
letter position in the word to be interrogated, the entries to 
either side of the correct entry are also read and a score is 
placed in the scoring memory for each of those entries. Thus, 
if what is believed to be me fourth letter of the spoken word 
is being considered, and the letter is "A", not only will 
subset {4, A} be read, but also subsets. {3, A} and {5, A}. 

According to the second stratagem, account is taken of the 
fact that letters which sound alike can be misunderstood by 
the device or program which interprets each spoken letter. 
For example, "AA" can easily be confused with "K" and 
"B" can be confused with "C\ "D M , "IT, etc In order to 
minimize errors resulting from such incorrect identification 
of individual letters, all of the entries associated with a given 
letter position (n) and with a letter (m) which rhymes with 
the letter that the system determined to have been spoken are 
also read and a score is placed in the scoring table for each 
identifying number which has been read. 

Although, here again, it may, on first consideration, 
appear that these stratagems would reduce the probability of 
correct identification of a spoken word which has been 
inputted by spelling, it has, surprisingly, been found that 
quite the opposite it true. The scores appearing in the scoring 
memory for those letters which were incorrectly interpreted 
will invariably be lower than the score for the correct word. 

A scoring memory may be conceptualized as having a 
structure as shown below in TABLE 4 in which a scare is 
accumulate for each word on the basis of the number of 
times the id# for mat word was read in TABLE 2 or 3 
according to one of the procedures described above. 



TABLE 4 


id# 


Score 


1 


Accumulated score for word 1 


2 


Aocumnbttid score for word 2 


K 


Accumulated score for word K 



For scoring purposes, each time a particular id# is found 
in an entry of the second database, this constitutes a scoring 
"hit" for that id#. Depending on the stratagem employed, Le. 
taking into account rhyming phonemes or letters or the 
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addition or deletion of one letter or one or two phonemes in 
the spoken word to be identified, several entries may be read 
for each speech element position of the word to be identified. 
In this case, each "hit" may be weighted on the basis of 

5 various factors. 

By way of example, the following scaring scheme may be 
employed when the second database is explored, or read, for 
evaluating each speech element position of the spoken word 
to be identified. 

to For each id# in an entry at row n which exactly matches 
the associated letter or phoneme of the spoken word, a 
weight of 10 is assigned; for each id in an entry at a row 
associated with a letter or phoneme which rhymes with that 
at row m, a weight of 6 is assigned. This will apply for id#s 

15 in the column n associated with the particular spoken word 
speech element position and in each other column of the 
second database (n±l,2) which is read to take account of 
letter or phoneme additions or deletion when the word was 
spoken. For each such id#, a hit value of 1 is assigned; 

20 Then, for each hit in the column n corresponding exactly 
to the spoken word speech element position, a hit value of 
1 is added to the previous hit value; 

If an id# is in the row m which exactly matches the 
associated letter or phoneme of the spoken word (and, the 

25 same id# was in the row which exactly matched the asso- 
ciated letter or phoneme of the spoken word) for the imme- 
diately preceding speech element position of the spoken 
word, a further hit value of 1 is added to the previous hit 
value; 

30 then the total hit value is multiplied by the weight and is 
divided by the number of speech elements, i.e. letters or 
phonemes, in the spoken ward. This division assures that 
longer words will not get higher scores just because of their 
length; 

35 The resulting score is placed in the scoring memory for 
each id#. 

Finally, for each id# whose stored word whose letter of 
phoneme string has a length exactly equal to that of the 
spoken word, an additional score, which may equal 10 is 

40 added to the scoring memory. 

After all speech element positions of the spoken word 
have thus been compared with the stored data, the scoring 
memory locations containing scores can be sorted in 
descending order of scores and the id#s associated with the 

45 three highest scores are outputted. 

It will thus be seen that the data processing sequence 
described above is relatively simple. After a spoken word 
has been processed by known techniques to derive a letter or 
phoneme string, or sequence, the letter or phoneme at each 

50 position of the string is used as a reference to determine 
which entries of the second database are read, scores are 
accumulated in the scoring memory, and after all positions 
of the spoken word string have been considered, the scoring 
memory is sorted. 

55 According to the present invention, each stored word is 
scored with respect to each recognized, or interpreted, 
speech element, e.g. letter or phoneme, in a manner gener- 
ally similar to mat described above, but with relative scores 
based on a determination of interpretation probabilities. 

60 Thus, for each position in the letter or phoneme sequence, 
each stored word is scored on the basis of the identity of the 
interpreted letter or phoneme and the probability that the 
letter or phoneme in that stored word was actually spoken. 
In further accordance with the present invention, the 

65 probability values are updated while the word recognition 
system is in use, to adapt to inherent imperfections in the 
interpreter and speech characteristics of the user. 



09/15/2003, EAST Version: 1.04.0000 



5,748,840 

11 12 

Id preferred embodiments of this invention, as shown in 

FIG. 3, in step SI, each possible combination of a spoken Pm _ 89 = a7g8 . 

speech element a. and a recognized, ox interpreted, speech 113 

element, p, is assigned a probability. P^. This is the 3 

probability that a speech element has been interpreted as a 5 Pbd ~1S~ "°* 176; 

speech element p when a speech element a was spoken. In l 

step SZ data representing each word of the plurality of ppA= ~TTT = a009 - 
words is stored, and in step S3, a sequence of spoken speech 

elements is received and each speech element and its posi- ' • x . , M . A . . - , „ 

. _ e . ^ . . . . *T The meaning of these probabilities is the following: 

uonm the sequence of spoken speech elements is inter- 10 p^Z^poken let£ A was interpreted as lette7"A" 89 

preted. Then, in stepS4 the interpreted speech elements are timet whole itoccuried 113 times thai Various spoken letters 

compared with stored data representing each word, this werc interpreted as "A", When a spoken letter is interpreted 

comparison including performing a computation to identify as « A » ^ c probability is "Ai^OJZZ that the spoken letter 

the word or the plurality of words whose speech elements was> m f act ^ 

correspond most closely to interpreted speech elements. For 15 P^— the spoken letter "B rt was interpreted as M D" 3 

each position in a speech element sequence, when a spoken times, while various spoken letters were interpreted as "D w 

speech element is interpreted as element (5, then each stored 18 times. Therefore, when a spoken letter is interpreted as 

word is given a score corresponding to the specific "D" the probability mat the spoken letter was actually "B" 

probability, P^, assigned to the combination of the inter- is 3 /is=0.167; 

preted element p and the actual speech element a in the 20 P m — by a similar analysis, when a spoken letter is 

same position in the sequence representing mat word. Then, interpreted as "A", the probability that the spoken letter was 

for each stored word, the scores for all positions are summed actually "P" is Vu 3=0.009. 

and the word with the highest total score is identified as the The data contained in the above Table will now be used 

spoken word. to demonstrate word identification according to the inven- 

According to an alternative approach, scores may be 25 tion. 

initially summed for all stored words for the first several Suppose mat the words for which data is stored in the 

sequence positions and subsequently scoring may continue database include: 

for only a selected number of stored words which had the ABACUS 

highest scores during initial summing for the first several APPLES 

sequence positions. 30 ABSURD, 

The probabilities, P^, used in the practice of the present the user spells A-B-A-C-U-S, and the interpreter "hears", or 

invention may be obtained, by way of example, as follows: interprets the spelled letters as A-D-A-C-U-F. The task to be 

several different speakers pronounce each letter of the performed according to the invention is to identify the word 

alphabet, each speaker pronouncing all letters one or several which was actually spelled. For this purpose, the sequence 

times; 35 of interpreted elements is compared with the sequence of 

each pronounced letter is received by the receiver and elements of each word for which data is stored in the 

interpreted by the interpreter; database by retrieving the probability that when the element 

a count is obtained of each time a spoken letter a was at each position of the sequence of interpreted elements was 

interpreted as a letter B, where each of a and P may be any interpreted as element p, the spoken element a was the 

letter from A to Z and a may be the same as or different from 40 element of the stored word at the same position of the stored 

P, and word element sequence. Then, for each stored word, all 

for each combination a, p, the probability, P^, is associated probabilities are summed, EP^, and the stored 

obtained by dividing the number of times, N^, that element word producing the largest sum is judged to be the word for 

a was interpreted as element p by the total number of times, which the elements were actually spoken. 

XNp, that spoken elements were interpreted as element p. 45 For this example, the probabilities for each element, or 

The following Table illustrates some of the data collected lctter » °f each of the stored words are as follows: 
during a theoretical performance of the above procedure: 

ABACUS 

TABLE ' m * 167 788 389 - 902 -221 ^24T" 



i 


A 


C 


D 


F 


U 


A 


89 — 


0 


0 


— 1 — 


0 


B 


1 


5 


3 


0 


0 


C 


1 


51 


0 


0 


0 


D 


0 


11 


5 


1 


0 


E 


1 


4 


' 4 


0 


0 


L 


0 


0 


0 


1 


0 


P 


1 


2 


2 


0 


0 


R 


0 


0 


0 


1 


0 


S 


1 


0 


0 


36 


1 


U 


0 


0 


0 


0 


111 


IN, 


113 


131 


18 


163 


123 



50 



A P P L E S = ii29 
.788 .111 .009 .000 .000 .221 lrlJ * 

ABSURD 
.788 .167 .011 .000 .000 .006 ~ * 

55 

Thus, the system indicates that the spoken word was 
ABACUS because 2P a p for the elements of that word is the 
largest 

According to a further feature of the invention, the data 
60 employed to calculate probabilities, P Q p, is updated after 
each word identification process, based on the results of the 
process. Specifically, the identified word is assumed to be 
the word whose elements were spoken by the user, each 
In the Table, the values for £N*, include counts appearing element P of the interpreted word is associated with the 
in cells which are not shown. Exemplary values for 65 element a in the same position in the identified word, each 
probabilities, P a& . computed from the above Table include, count in the data Table for which an or, p association exists 
by way of example: between the elements of the interpreted word and if the 
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13 



14 



10 



15 



20. 



identified word is incremented by one, and new values far 
IN B are computed for each column associated with an 
interpreted element p. 

Thus, in the above example where ABACUS was iden- 
tified from the interpreted ward ADACUF, the count 
will be incremented by 2, and N^y and will 

each be incremented by 1. Correspondingly, ZN A will be 
incremented by 2 and each of IN C , ZN^, TN P and ZN^ will 
be incremented by 1. 

This updating permits a system according to the invention 
to identify a spoken word with greater reliability by com- 
pensating for imperfections in the interpreter andVor speech 
impediments on the part of the user. For example, because 
of such interpreter imperfections and/or user speech 
impediments, the interpreter may frequently interpret a 
spoken letter, such as "B", as the letter "V". If mis should 
occur, the updating procedure will change the data to gradu- 
ally increase the value of P BV > This will increase the like- 
lihood of correctly identifying spoken words containing the 
letter "B". On the other hand, each time the letter "V" is 
correctly identified, Nw will be incremented. In both cases, 
£N V will be incremented. 

Such a procedure is particularly helpful for a system 
which is to have a single user. If a system is to have several 
users, data for calculating probabilities may be separately 25 
stored for each user and the system can be provided with a 
selector which is operable by a user to identify that user. 
Based on operation of the selector, the data dedicated to that 
user can be employed for calculation of probabilities and can 
be updated after each use. 30 

The description provided above with respect to 
pronounced, or spoken, letters, is equally applicable to a 
system in which a word itself is spoken and the speech signal 
is divided into phonemes. In either case, use can be made of 
conventional interpreters since the initial speech element 35 
determination is not a novel feature of the invention. 

A system for identifying words according to the 
invention, which system can be implemented by a suitably 
programmed digital computer, is shown in FIG. 2. This 
system includes a CPU 22 for controlling system operation. 40 
a first data storage unit 24 storing the data representing the 
plurality of words, a second data storage unit 26 storing the 
data used to calculate the probabilities, P^, speech interface 
It for receiving spoken speech and converting the spoken 
speech elements into digital signals, an interpreter 30 con- 45 
nected to interface 10 to receive the digital signals and 
interpret those signals to produce an identification of each 
spoken speech element, and a comparison and calculating 
unit 32 for using probabilities, P ap , produced from data in 
unit 26, to compare the interpreted speech elements with the 50 
data stored in unit 24 in order to identify the word repre- 
sented by the stored data which corresponds to the spoken 
speech elements. Units 24 and 26 and interpreter 30 are 
connected to supply data and signals to unit 32 under control 
of CPU 22. Unit 32 is further connected to supply data 55 
relating to the speech elements of identified words, and 
interpreter 30 is connected via CPU 22 to supply data 
relating to interpreted speech elements to unit 26 to allow 
data updates in unit 26. Unit 32 has an output connected to 
an output device, such as speech synthesizer 12, which 60 
provides the identified word in visible or audible form, 
Finally, an input panel 36 is connected to CPU 22 to permit 
the user to perform appropriate manual control actions. 

While the description above refers to particular embodi- 
ments of the present invention, it will be understood that 65 
many modifications may be made without departing from 
the spirit thereof. The accompanying claims are intended to 



cover such modifications as would fall within the true scope 
and spirit of the present invention. 

The presently disclosed embodiments are therefore to be 
considered in all respects as illustrative and not restrictive, 
the scope of the invention being indicated by the appended 
claims, rather than the foregoing description, and all changes 
which come within the meaning and range of equivalency of 
the claims are therefore intended to be embraced therein. 



APPENDIX 

ALGORITHM FOR THE RHYMING SPELLING CHECKER 



Database creation 



'create TABLE 2 



Create nmwi n> tabic (TABLE 2) by sorting mm« list 
table (TABLE 1) into letter group subsets 
**• pages 8 and 9 ••• 

Spell Check Algorithm 



'create & two dimension RHYME table for the alphabet as 
follows, 

placing a zero at the end of each rhyme group 



letter A*s rhymes 
RHYME(1,1)=1 'A rhymes with A 
RHYME(1,2)=8 H rhymes with A 
RHYME(13)=10 7 rhyme* with A 
RHYME<1,4>=U K rhymes with A 
RHYME<1^)=0 
RHYME<1,6)=0 
etc. 

RHYME(l t 10>0 
'letter B's rhymes 
RHYME(2,1>=2 *B rhymes with B 
RHYME(2 T 2)=3 XI rhymes with B 
RHYME(23M D rhymes with B 
RHYME(2y4>=5 *E rhymes with B 
RHYME<24>=7 X3 rhymes with B 
RHYME(2,6>16 V rhymes with B 
RHYME(2,7>20 T rhymes with B 
RHYME(2 I 8>=22 V rhymes with B 
RHYME(2 t 9>=26 "Z thymes with B 
RHYME(21,10)=0 
letter C's rhymes 
RHYME<3,1)=3 *C rhymes with C 
RHYME<3,2)=2 *B rhymes with C 
RHYME(3,3)=4 TJ rhymes with C 
RHYME(3,4)=5 E rhymes with C 
RHYME(3,5>=? XJ rhymes with C 
RHYME(3,6)=16 T rhymes with C 
RHYME(3,7)=20 T rhymes with C 
RHYME(3^)=22 V rhymes with C 
RHYME(3,9)=26 Z rhymes with C 
RHYME(3,10)=0 



get input string 



Head a string of L letters from speech recognized (page 
12 line 18) 

READ STRINGO far t=l roL 



Tax each column in TABLE 2 (page 13) 
FOR o=l to L 

"Read column n of tt*™^? list table (TABLE 2) into *™*rrKw y 
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-continued 



16 



APPENDIX 

ALGORITHM FOR THE RHYMING SPELLING CHECKER 
(page 13) 

READ COL (i)= TABLE 2(imd) far m= 1 to M 

Tor letters to each ikfe of the column (page 14 Ens 15) 

far j = d~1 to 0*1 

Extract the next letter from the string (page 14 line 
13). 

LET = STRING(j) 

"For each rhyming letter in group m (page 14 line 26) 
FOR r = 1 to 10 

"Get the subset numhrr from the rhyme table (page 14 line 
26) 

m= RHYME (k/) 

If end of rhyme group not reached (page 14 line 26) 
IF m >0 THEN 

Tor each name ID m the rhyming group subset (page 14 
Hne26) 

FOR i = 0 to number in group m 

'extract the name number ID (page IS line 33) 

ID = COL (nv) 

"Set weight of 10 if exact match or 6 if rhyme (page IS 
line 33) 

IF m = RHYME (k,l) THEN 
WEIGHT = 10 



WEIGHT = 6 
ENDIF 

Initialize the hit count to 1 (page 16 line 2) 
HITS = 1 

Increment the hit count if the exact column (page 16 
line 4) 

IF k = n THEN 

HITS = HITS + 1 
ENDIF 

Increment hit count if correct letter sequence (page IS 
line 7) 

IF (n-k) = previous (n-K) for this name ID THEN 

Hns=Hns + i 

ENDIF 

"Calculate score per letter (page 16 line 14) 

SCORE = WEK3HT*SCORE/L 

Increment the hit table, TABLE 4 (page 16 line 19) 

TABLE40D) = TABLE4(ID) + SCORE 

NEXT g 

ENDIF 

NEXT r 

NEXT k 

NEXT n 



'search TABLE4 for the three best scores (page 16 line 
28) 



FOR i = 1 to end of TAELE4 
IFTABLB4(i) > BestScore(3) THEN 
BestScoreO) = TABI£4<i, SCORE) 
BestlD(3) = TABLE4(UD) 
IF BcstScore<3) > BestSoore(2) then 
SWAP BestScom<3% BestScore(2) 
SWAP BesOD(3), BestID(2) 

ENDIF 

IF BestSoore<2) > BestScore(l) then 

SWAP BestScore(2), BestScoxe(l) 
SWAP BesOD(2X BestID(l) 

ENDIF 
ENDIF 
NEXT i 
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What is claimed: 

1. A method far identifying any one of a plurality of words 
using a programmed digital data processing system, each 
word having an audible form represented by a sequence of 
spoken speech elements, with each speech element having a 65 
respective position in the sequence, the digital data process- 
ing system being connected to means far receiving spoken 



speech elements of a word and interpreting each received 
speech element, 

wherein there is a plurality of possible speech elements, 
each spoken speech element is a speech element a, 
each interpreted speech elements is a speech element P, 
and each spoken speech dement a may be interpreted 
as any one of a plurality of different speech elements p, 
one of the speech elements p being the same as speech 
element ou said method comprising: 

assigning to each of the possible speech elements a 
respective plurality of probabilities, P^, that the 
speech element will be interpreted as a speech element 
p when a speech element a has been spoken; 

storing data representing each word of the plurality of 
words, the data for each word including identification 
of each speech element in the word and identification of 
the respective position of each speech element in the 
sequence of speech elements representing the word; 

in the means for receiving and interpreting, receiving a 
sequence of speech elements spoken by a person and 
representing one of the stored words, and interpreting 
each speech element of the spoken word and the 
position of each speech element in the sequence of 
spoken speech elements; and 

comparing the interpreted speech elements with stored 
data representing each word of the plurality of words 
and performing a computation, using the probability, 
P^, associated with each interpreted speech element p 
to identify the word of the plurality of words whose 
speech elements correspond most closely to interpreted 
speech elements. 

2. A method as defined in claim 1 wherein said step of 
performing a computation comprises summing the 
probabilities, P^, associated with the interpreted speech 
elements p of the received sequence of speech elements and 
with the speech elements ox in the same positions as the 
interpreted speech elements for at least a number of the 
plurality of words, and arterniining mat word of the number 
of words which is associated with the largest sum, 

3. A method as defined in claim 2 comprising the pre- 
liminary step of having each of the possible speech elements 
spoken a given number of times, N a , interpreting each 
spoken speech element in the means far receiving and 
interpreting, determining the number of times, N^, each 
spoken speech element a is interpreted as a speech element 
p, and for each combination of a respective spoken speech 
element a and a respective interpreted speech element p, 
calculating a probability, P^, equal to N ap , for a=p, divided 
by the sum of all N ap for the respective interpreted speech 
element p and all spoken speech elements a. 

4. A method as defined in claim 1 comprising the further 
step, after said steps of comparing and performing a 
computation, recalculating the probabilities, P 0? by 
increasing, by one unit, each associated with each 
interpreted speech element p and the speech element a in the 
same position as the interpreted speech element in the 
identified word 

5. A method as defined in claim 1 wherein each speech 
element is a letter spoken when a word is spelled. 

6. A method as defined in claim 1 wherein each speech 
element is a phoneme pronounced when a word is spoken. 

7. A programmed digital data processing system for 
identifying any one of a plurality of wards, each word 
having an audible farm represented by a sequence of spoken 
speech elements, with each speech element having a respec- 
tive position in the sequence, wherein there is a plurality of 
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possible speech dements, each spoken speech element is a 
speech element a, each interpreted speech elements ia 
speech element fl and each spoken speech element a may 
be interpreted as any one of a plurality of different speech 
elements p, one of the speech elements p being the same as 5 
speech element a, said apparatus comprising: 
first data storage means for storing, for each of the 
possible speech elements, a respective plurality of 
probabilities , P^, that the speech element will be 
interpreted as a speech element p when a speech 10 
element a has been spoken; 
second data storage means for storing data representing 
each word of the plurality of words, the data for each 
word including identification of each speech element in 
the word and identification of the respective position of 1 5 
each speech element in the sequence of speech ele- 
ments representing the word; 
means for receiving a sequence of speech elements spo- 
ken by a person and representing one of the stored ^ 
words, and for interpreting each speech element of the 
spoken word and the position of each speech element in 
the sequence of spoken speech elements; and 
means connected for comparing the interpreted speech 
elements with stored data representing each word of the 25 
plurality of words and performing a computation, using 
the probability, P ap , associated with each interpreted 
speech element p to identify the word of the plurality 
of words whose speech elements correspond most 
closely to interpreted speech elements, 30 

8. A system as defined in claim 7 wherein said means for 
comparing and performing a computation comprise means 
for summing the probabilities, P^, associated with the 
interpreted speech elements p of the received sequence of 
speech elements and with the speech elements a in the same 35 
positions as the interpreted speech elements for at least a 
number of the plurality of words, and means for detennining 
that word of the number of words which is associated with 
the largest sum. 

9. A system as defined in claim 8 further comprising 40 
means for performing a preliminary step of having each of 
the possible speech elements spoken a given number of 
times, N a , interpreting each spoken speech element in the 
means for receiving and interpreting, determining the num- 
ber of times, N a p, each spoken speech element a is inter- 
preted as a speech element p, and for each combination of 



a respective spoken speech element a and a respective 
interpreted speech element fV, calculating a probability, P^, 
equal to for cc=p, divided by the sum of all N^for the 
respective interpreted speech element p and all spoken 
speech elements a. 

10. A system as defined in claim 7 further comprising 
means for recalculating the probabilities, P a ^ by increasing, 
by one unit, each N aP associated with each interpreted 
speech element p and the speech element a in the same 
position as the interpreted speech element in the identified 
word. 

11. A method for identifying any one of a plurality of 
wards using a prograrnmed digital computing system, each 
word having an audible form representable by a sequence of 
speech elements each having a respective position in the 
sequence, wherein each speech element has at least one 
identifiable acoustic characteristic and a plurality of the 
speech elements are substantially identical with respect to 
the at least one identifiable acoustic characteristic, said 
method comprising: 

storing, in the digital computing system, a digital repre- 
sentation corresponding to each of the plurality of 
words; 

receiving a sequence of speech elements spoken by a 
person and representing the audible form of one of the 
plurality of words, and storing representations of the 

. received speech elements and their respective positions 
in the spoken sequence; 

at each position in the spoken sequence, detennining each 
speech element, other than the speech element for 
which a representation is stored, which is substantially 
identical to the speech element for which a represen- 
tation is stored with respect to the at least one identi- 
fiable acoustic characteristic, 

comparing combinations of speech elements for which 
representations are stored and determined speech de- 
ments for a word with stored words; and 

identifying the stored word for which the comparison 
produces the best match with one of the combinations 
of speech dements. 

12. A method as defined in claim 11 further comprising 
reproducing the stored word which is identified in said 
identifying step. 
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